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1 Introduction 



Studying the relationship among the information divergence D and the variational distance V or, 
more specifically, determining lower bounds on D in terms of V, has been of interest at least 
since 1959, when Volkonskij and Rozanov pQ showed that D > V — log(l + V). The best known 
result in this direction is usually referred to as Pinsker's inequality and states that D > \V 2 . 
In general, studying the relationship between D and V is important because it allows one to "... 
translate results from information theory (results involving D) to results in probability theory 
(results involving V) and vice versa" (Fedotov, Harremoes and Tops0e, 0). For instance, Barron 
[2] found a strengthened version of the central limit theorem by showing convergence in the sense 
of relative entropy and then using Pinsker's inequality to conclude convergence in the variational 
norm. In different settings, this idea has also been used by Tops0e [1| and Harremoes and Ruzankin 
[3]. Interestingly, these kind of results and its relation with Gagliardo-Nirenberg and generalized 
Sobolev inequalities have been used recently in order to obtain the decay rate of solutions of 
nonlinear diffusion equations — see Del Pino and Dolbeault |S] and references therein. 

Pinsker's inequality was proved independently by Csiszar (2j and Kemperman [Sj, building on 
previous work by Pinsker Kean jlUj and Csiszar jllj . The constant | in D > i V 2 is best 
possible, in the sense that there is a probability space and two sequences of probability measures 
P n and Q n such that D(P n ,Q n )/V 2 (P n ,Q n ) j \. Sharpened Pinsker type inequalities bounding 
D by higher-order polynomials in V 2 are also available. For instance, D > \V 2 + gg^ 4 — see 
Kullback and Vajda L 14,, where again the constant ^ is best possible, in the sense that 

there are sequences P n and Q n such that [D(P n ,Q n ) - \V 2 {P n ,Q n )]/V\P n ,Q n ) { ^. More 
recently, Tops0e showed in ^ that D > \V 2 + + ^V 6 + V 8 , while Fedotov et al |2] 
have obtained a parametrization of the curve v \— > L(v) = inf{D(P,Q) : V(P,Q) = v} in terms 
of hyperbolic trigonometric functions and argue in ,16. that the best possible extended Pinsker 
inequality contains terms up to and including V 48 . 

Let P and Q be probability measures on a measurable space (Q,A) and p and q their densities 
or Radon-Nikodym derivatives with respect to a common dominating measure fi. The information 
divergence is D(P,Q) = f plog(p/q) dfi and is also known as relative entropy or Kullback- Leibler 
divergence. The variational (or L 1 ) distance is V(P, Q) = J \q — p\ d\x. The /-divergence generated 
by / is Df(P, Q) = f pf(q/p) dp,, where / : (0, oo) — > R is convex and f(l) = 0. Jensen's inequality 
implies that Df(P,Q) > with equality holding if and only if P = Q, provided that / is strictly 
convex at u = 1. Hence, Df(P, Q) can be thought of as a measure of discrepancy between P and 
Q. The class of /-divergences was introduced by Csiszar [Jj and Ali and Silvey It includes 
many of the most popular distances and discrepancy measures between probability measures. Both 
D and V belong to this class, respectively for f(u) = — logn and f(u) = \u — 1|. All of the 
following are also /-divergences: the x 2 divergence X 2 (P^Q) = J p 1 d[J> = J d^ — 1, the 
Hellinger discrimination h 2 (P, Q) = ^ / {^fq — yjp) 2 dfi = 1 — / ^/qpdfi, the Triangular or Harmonic 

divergence A(P, Q) = J p +q dfi = Aj -p^ dfi — 2 = 2 — 4 / dfi, the Capacitory discrimination 
C{P,Q) = D{P,M) + D(Q,M) where M = (P + Q)/2 and the Jeffrey's divergence J(P, Q) = 
D{P,Q) + D{Q,P) = J(q — p)log(q/p) dfi. A convenient one-parameter family which includes 
many of the above as special cases is generated by the convex functions f(u) = [a(a — 1)] _1 (u a — 1) 
(q 0,1). The resulting divergence D {a) (P,Q) = [a{a - 1)] _1 [J q a p 1 ~ a d[i - 1] is called relative 
information of type (1 — a) by Vajda JH] and Taneja [Ej- It is easy to check that \ 2 = 2-D(2), 
Ah 2 = D(i/2) an d D = lim Q ^o D^ Q y The Tsallis' and the Cressie-Read divergences, which are used 
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extensively in many areas including physics, economics and statistics, are respectively T a = <xD(i_ a ) 
and CR\ = -D(-a) ( see |20J|I^). Finally, the Renyi's information gain of order a > 0, I a (P,Q) = 
(a — l) _1 log[J p a q 1 ~ a d/j], of which the information divergence is also a special case as a — > 1, 
although not itself an /-divergence, can be expressed as I a = (a — log[l — a(l — ct)Dn a \], cf. 

12m [22i 

Regarding the relationship between /-divergences and V, bounds are available for some spe- 
cial cases involving divergences which are more or less easy to manipulate. For instance, it is 
known that X 2 > V 2 , \V 2 < A < V and \V 2 < h 2 < ±V, see Dacunha-Castelle J23I, LeCam 
[2*1| . Dragomir, Gluscevic and Pearce [25] and Tops0e ^j. A precise bound is available for the 
Capacitory divergence, for which Tops0e H3 showed that C(P,Q) > £^° =1 [n(2n ~ l)2 2n ]~V 2n 
= log^ + ^logf±f 

Although we know of no general result giving a lower bound for /-divergences in terms of V, 
it appears as intuitively clear to us from the fact that /-divergences share many of the properties 
of D that inequalities similar to Pinsker's should also hold for other divergences. This should be 
the case, for instance, of relative information of type (1 — a) with a close to zero or that of Renyi's 
information gain of order a with a close to one. Maybe the closest to a general statement giving a 
kind of lower bound for an arbitrary Df in terms of V is in Csiszar |271 Theorem 1], which states 

that Df(P,Q) < e implies, for sufficiently small e, that ^^-V 2 (P,Q) < e (cf. our Theorem 
below), implying then that V should be small whenever Df is small enough. 

This paper will be the first of a series dealing with the relationship between /-divergences 
and variational distance. In particular, our objective here is to discuss conditions under which an 
/-divergence satisfies either a Pinsker's type inequality Df > CfV 2 or a fourth-order inequality 
Df > C2jV 2 + c^jV 4 . We will show in Section 3 that a sufficient condition for Df > CfV 2 is 
that the ratio between (u — I) 2 and the difference between the generating / and its tangent at 
u = 1 be upper bounded by a straight line a + bu with nonnegative coefficients a and b such 
that a + b = Cf = , if we want Cf to be best possible. A sufficient condition for having a 
fourth-order inequality, always with best possible coefficients, is presented in Section 4. Each of 
these theorems is followed by a corollary which gives conditions on the derivatives of / which are 
easier to check in practice than the original conditions on /. As a consequence of these we show in 
Section 3 that the relative information of type (1 — a) satisfies Dr a \ > ^V 2 whenever —1 < a < 2, 
a/0,1. This inequality is improved in Section 4 to Dr a \ > ^V 2 + ^(a + 1)(2 — a)V 4 . Using that 
I a = (a — l) -1 log[l — a(l — oi)Dri— a )], we also obtain that the Renyi's information gain of order 
a satisfies I a > V 2 + + 5a — 5a 2 ) V 4 whenever < a < 1. 

Besides sections 3 and 4 dealing respectively with second and fourth-order inequalities, the rest 
of the paper is organized as follows. Section 2 introduces some additional notation and states 
a fundamental inequality between powers of V and Df in Corollary |2j In Section 5 we bring 
forward an argument from the sequel of this paper and briefly discuss why the tools that we use 
here to obtain second and fourth-order inequalities are insufficient to obtain sixth and higher-order 
inequalities when we are interested in best possible coefficients. Some technical results needed in 
Section 4 are presented in an Appendix. Finally, since some of the proofs in Section 4 and in the 
Appendix require somewhat lengthy calculations, we have recorded a MAPLE script that could 
help the reader interested in checking them. Although the script is not included here for reason of 
space, it is available from us on request. Notwithstanding, we stress that we have included in the 
paper what we believe are full and complete proofs for all statements made. 
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2 Notation and preliminary considerations 



Throughout, equalities or inequalities between divergences will be understood to hold for every 
pair of probability measures, so that we will write for instance "£> > - V 2 " instead of "D(P, Q) > 
1 

— V 2 (P, Q) for every P,Q V . An equivalent definition for the Variational Distance is V(P,Q) = 

2 sup{|Q(^) - P(A)\ : A £ A} = 2 [Q(B) - P(B)], where B = {lo e Q : q(u) > p(uj)}. Hence 
< V(P, Q) < 2 with equality holding respectively if and only if P = Q or P _L Q. It is well known 
that the Information Divergence satisfies < D(P, Q) < +oo. D(P, Q) = can occur if and only 
if P = Q, while P <^.Q implies that D(P, Q) = +oo, although the reciprocal does not hold. 

To avoid unnecessary discussion, we will assume the usual conventions /(0) = \\m u ^f(u), 
• /(O/O) = and • /(a/0) = lim e jo € f( a / e ) = nn M+°° f( u )/ u - An /-divergence does not 
determine univocally the associated /. Indeed, for any a fixed, Dj and -D/- a («-i) are identical. 
For instance, D = D_\ ogu = D u _i_i ogu . For any convex / we will let f(u) = f(u) — f'(l)(u — 1), 
which is nonnegative due to convexity considerations (more precisely, /'(l) can be taken to be 
any number between the left and right derivatives of / at u = 1). We note here that second and 
higher-order derivatives of / and / coincide. Indeed, we will often switch from one to the other in 
sections 3 and 4 . 

In general, /-divergences are not symmetric, in the sense that Df(P,Q) does not necessarily 
equals Df(Q, P), unless the generating / satisfies that f{u) = uf(l/u) + a{u — 1) for some fixed a. 
This is the case for instance of V and the h 2 and A divergences but not that of D or x 2 ■ Whenever an 
/-divergence is not symmetric we could define the reversed divergence by letting /i?(ti) = u/(l/u), 
so that Df R (P, Q) = Dj(Q, P). Similarly, beginning from an arbitrary /-divergence it is possible to 
construct a symmetric measure by using the convex function fs(u) = f(u) + /r(^)- For instance, 
the reversed information divergence is Dr(P,Q) = J qlog(q/p) dfi and its symmetrized version is 
the already mentioned Jeffrey's divergence. 

The following lemma is slightly more general than what we will actually need in Sections 3 and 4. 
It gives an upper bound on | / gq dp — f gpdp \ = \EQg — Epg \ in terms of a certain higher moment 
of g and an /-divergence Df. If we interpret Q as an approximation to P, then \Eqg — Epg\ is 
the error between the corresponding approximate and actual expectations. The lemma generalizes 
results which we have used in |28| I29j in order to obtain upper limits for the approximating error 
in the context of Bayesian Statistics. 

Lemma 1 Let g be both P and Q integrable, k(u) > such that J pk{q/p) dp = 1, and n > 1. 
Then for any fixed a 

^ sup I ^ I • [E - |g - "r /(n - 1} r 1 • D f (p, q) , a) 

where r = pk(q/p), E r |<7 — a| n /( n_1 ^ = / \g — a\ n ^ n ^ 1 ^r dp is the [n/(n — 1)] — th moment of g around 
a with respect to the probability density r and, as before, f{u) = f(u) — f'{l){u — 1) and /'(l) is a 
number between the left and the right derivative of f at u = 1. 

PROOF: Let m = nj(n — 1) (so that n and m are conjugate) and C = {uj : q{uj) ^ p(u>)}. Then 
we have for any real a that 

< / \g-a\\q-p\ dp 
JC 



gqdfi- J gp dp 



gqd\x- \ gpdn 



C 



(g - a)(q-p) dp, 
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IC 



\g-a\ \g/p- 1| 

f 1/n (q/p) 
\g-a\ m \q/p-l\ m 
C f m ' n {q/p)k{q/p) 



f 1/n (q/p)pd^< 



\g - a\ m \q/p- 1| 



l/m 



< sup 



u>o,u^i \f m l n {u)k{u) 



k(q/p) pdfi 

g - a\ m r dp 



C f m l n {q/p) 
■ [Df(P,Q)] 1/n 

l/m 



■pdfj, 



l/m 



[Df(P,Q)] 



1/n 



[D f (P,Q)] 



1/n 



where the second inequality follows from Holder's inequality. The desired result now follows af- 
ter taking the re-th power in the leftmost and rightmost terms and noting that sup u>0u ^i{(u — 

ir/ f m /n {u)k{u)} n/m = sup^^^U - l) n / f ( U )k n / m (u)} = SUp^^U " l) n / f{u)k n ^ (u)}. 
□ 

Taking g = I p > q and a = \, the left hand side of Q becomes 2~ n \V(P, Q)\ n , while ~E r \g — 
a \n/{n-i) _ j \i p > q — i| n /( n_1 )r d^i < 2 -n /( n_1 ). Hence, we have the following corollary. 



Corollary 2 Let k, f and f be as before. Then 



V n (P,Q)< sup 



(u-lY 



u>0,uj^l [ f(u)k n l (u) 



Df(P,Q) 



(2) 



Remark 1. These results are still valid for nonconvex / provided that f(u) > and we interpret 
D f (P,Q)=Jf(q/p)pdp. 

Remark 2. Although Q holds for any a, we usually would like to use a value for which 
E r |<7 — a| n// ( n_1 ) is small. Taking a = E r g = Jgrd^, could be a good idea. 

Remark 3. A more precise formulation would use the L°° norm (or essential supremum) of 
(u — l) n / f '{u)k n ~ l (u) with respect to the measure r dp, instead of the supremum for u > 0, u ^ 1. 

Of particular interest will be the cases n = 2 and n = 4, in which case equation (J2J) becomes 



V 2 (P,Q)< sup 



{u-lf 



u>o,u^i \ f{u)k{u) 



Df(P,Q) 



and 



V\P,Q)< sup j fr 1)4 1 D f (P,Q). 
u>o,u^i { f(u)k 3 (u) J 



(3) 



(4) 



Some interesting inequalities follow directly from (j3J). For instance, taking (i) k(u) = 1 and 
f(u) = {u-lf we obtain that x 2 > V 2 , (ii) k{u) = (l + u)/2 and f(u) = (u - l) 2 /( 1 + u ), so that 
(u - lf/f{u)k(u) = 2, it follows that A > V 2 and (iii) k{u) = l) 2 /[2(2 - h 2 (P,Q))} and 

f(u) = I(^-l) 2 , so that (u-lf/f(u)k(u) = 4[2 - h 2 {P, Q)], we obtain that 4h 2 {2-h 2 ) > V 2 - 
see Kraft [30], cited in Dragomir et al. [25] • Although we are not specially interested here in f- 
divergences for which f"(l) = 0, Corollary ^ also gives some bounds for this case. For instance, for 
the Triangular Divergence of order v > 1, A U (P, Q) = J J^~^L~i dp, (see Tops0e [26; or Dragomir 

et al. |251)) we obtain after taking n = 2u, f(u) = (u - lf u /(l + uf v ~ x and k(u) = 1 + u in (J2J) 
that A„ > V 2u . 
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Figure 1: h w {u) = (u — l) 2 /(u — 1 — logu)[w + (1 — w)u) satisfies (i) h w (l) = 2 for every w ( by 
continuity) and (ii) when w is greater (smaller) than |, h w attains its maximum for a u greater 
(respectively smaller) than 1, hence for w/^, the maximum value of h w is greater than 2. 



3 Second-order inequalities 

We first note that also Pinsker's inequality follows from ©. To see this, take f(u) = u — 1 — log-u 
and k(u) = (1 + 2k)/3 to obtain that V 2 < sup u>0 tt ^i hi/ 3 (u) • D, where /i 1/ / 3 (n) = 3(u — l) 2 /[{u — 
1 — logu)(l + 2u)] (the reason for the subindex 1/3 here will be made clear shortly). Now note that 
su Pn>o,u^i ^i/3( n ) = 2 because lim u ^i /i 1/ / 3 (n) = 2,while g(u) = 2{u— 1 — logit)(l + 2it) — 3(u — l) 2 > 
for u > since g(l) = g' '(1) = and g"(u) = 2(u — l) 2 /u 2 > 0. Observe that using k(u) = 
(1 + 2u)/3 in the previous argument amounts up to using the mixture k(q/p)p = \p + \q in the 
proof of Lemma ^ It is interesting to note the reason why this and only this mixture works. Using 
a mixture wp+ (1 — w)q for some < to < 1, w ^ ^ is equivalent to taking k(u) = 1 + (1 — to)(l — u) 
in equation ©. Now, define h w (u) = (u — l) 2 /{u — 1 — log«)[l + (1 — w)(u — 1)] and observe that 
lim^i h w (u) = 2 for any w, so that sup u>0 u ^ 1 h w (u) > 2, while h w (u) attains its maximum value 
for u = 1 if and only if w = g. (see Figure P). Hence, it follows that sup u>0jU ^i h w (u) > 2 whenever 
w ^ 3 and using any mixture with w ^ | in Lemma will produce a less than optimal inequality 
D > cV 2 with c < \. 

The idea in the previous paragraph can be formulated for arbitrary /-divergences. Let h w (u) = 
(u — l) 2 /{f(u)[(l + (1 - w)(u - 1)]} and b w (u) = l/h w (u) and suppose that f(u) = f'(l)(u — 
1) + \f'{l)(u - l) 2 + \f"(l)(u - If + o{\u - 1| 3 ) with /"(l) / 0. Equation © implies now that 
D/ > [sup u>0 n _^ 1 /i„,(n)] _1 y 2 for every < w < 1. Note that lim u _>i h w {u) = 2//"(l) for every 

so that [sup^Qu^ h w (u)]~ > /"(l)/2. In order to obtain the inequality Df > 2 l ' V , we must 
find aw = wj such that [sup u>0u _ ! i 1 h w (u)}^ 1 = /"(l)/2. In other words, the (continuity corrected) 
function h Wf should be maximized at u = 1, or equivalently b Wf = l/h Wf should be minimized at 

u = 1. It is easy to check that b w (u) = - |^- + [ - |^ (1 — w) + - g^ ](rt — l)+o(|u — 1|). Hence, for 
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1 f"'(l) 

to have a minimum at u = 1, the first-order term should vanish and hence Wf = 1 + g jwn ■ Finally, 
for h Wf (u) to be actually maximized at u = 1, we must have that h Wf {u) < h Wf (l) = 2//"(l) and 
hence that 

/(u)[l + (1 - - 1)] > H^(u - l) 2 . (5) 

This leads to the following theorem. 

Theorem 3 (Pinsker type inequality for /-divergences) . Suppose that the convex function 

f is differentiable up to order 3 at u = 1 with f"(l) > 0, and let wj = 1 + 1 j/7m • Then implies 

f"(l) 9 f"(l^ 
i/iai Df > 2 V . The constant - 2 y is best possible. 

Proof: Although a rigorous proof can be obtained following the ideas above, the following 
argument is easier once the right condition has been identified. First, note that for (JSJ) to hold we 
must have that [1 + (1 — Wf)(u — 1)] > for every 1. Hence, © implies that 

D f (P, Q) = D f ~(P, Q)>—J l + {l _ Wf){q/p _ l) Pd»>—V(P,Q), 

where the last inequality follows from © after taking f(u) = (u — 1) 2 /[1 + (1 — Wf)(u — 1)] and 
k(u) = [l + (1 -«;/)(« -1)]. 

To show that c/ is best possible, consider a binary space and suppose that P assigns probabilities 
p > and (1 — p) > to each point of fi, say P = (p, 1 — p). For small v define Q v = (p + 
u/2, 1 - p - v/2) so that V(P, Q v ) = v and D f (P, Q v ) = pf(l + v/2p) + (1 - p)/(l - u/2(l - 
p)) = p ^p- (v/2p) 2 + (1 - p) f^f- [v/2{l - p)} 2 + o(v 2 ) = [4p(l - p)]" 1 u 2 + o(v 2 ). Hence 

lim.^0 £>/(-P, Qv)/V 2 (P, Q v ) = ^P-[4p(l - p)Y l . Taking p = \ completes the proof. n 
Remark. The last part of the proof suggests that, when there is no set A with P(A) = ^, 

a better constant c/(P) = ^^^[4 sup{P(A)[l - P(A)] : A £ A}}^ 1 > can be found so 

that Df(P,Q) > Cf(P)V 2 (P,Q) for every Q. This problem has been addressed recently for the 
information divergence by Ordentlich and Weinberger |31j). 

For most divergences the condition in the next proposition is easier to check than (J5|). 

1 f"'(i) 

Corollary 4 Let f and f be as before, Wf = 1 + | > an d suppose that f is three times differ- 
entiable with f"(u) > for all u. Then 



sgn(u — 1 



+ (1 " Wf){u " 1)] + 3(1 " Wf) ) - (6) 



implies |3p and hence that Df > - ^ V 2 . 

Proof: Let g{u) = f(u)[l + (1 - w f )(u - 1)] - ^(u - l) 2 . Now g(l) = 0, g'(u) = f'(u)[l + 
(1 - w f )(u - 1)] + (1 - w f )f(u) - f"(l)(u - 1) and hence g'(l) = 0, g"(u) = f"(u)[l + (1 - w f )(u - 
1)] + 2(1 - Wf)]'(u) - f"(l) and therefore g"(l) = and finally g"'(u) = f"'(u)[l + (1 - w f ){u - 
1)] + 3(1 - u> / )/"(l)(n). Hence, © implies that g'"{u) < for u < 1 and ff '"(l) > for u > 1, so 
the following Lemma implies that g must be nonnegative, which is equivalent to (JSJ). rj 

Remark. Since we also have that g'" (1) = 0, © is also implied if (u) = f^[l + (l-w f ){u- 
1)] + 4(1 — Wf)f"'(u) > 0, but we usually find © easier to check. 
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Lemma 5 Let n > 1 and g : (0, oo) — > R be (n + 1) times differentiable with g(l) = g'(l) = 
• • • = g( n \l) = 0, where g^ is the n-th derivative of g, and suppose that either (i) n is even and 
g( n+l \u) < for u < 1 and g^ n+l \u) > for it > 1 or (ii) n is odd and g( n+1 \u) > /or every it. 
Then g{u) > /or every u. 

Proof: We prove first the case that n is odd. Since <?'" + '(u) > it follows that 
convex, and since = ((/"""^/(l) = 0, it must have a minimum at u = 1, hence must be 

nonnegative. Repeat the argument backwards to obtain that g( n_3 ) , . . . ,g are also nonnegative. 
Now in the case (i) that n is even, note that g^ decreases for u < 1 and increases for u > 1 and, 
since </ n )(l) = 0, it must be nonnegative, which reduces to the previous case. rj 

In our view the most important application of Corollary |1] is to the relative information of type 
(1 — a) and the Renyi's information gain of order a. 

Corollary 6 D {a) > \ V 2 whenever -1 < a < 2, a ^ 0, 1. Also, T a > \ a V 2 for < a < 1. In 
both cases the coefficient of V 2 is best possible. 

Proof: For Dr a ) we have f(u) = [a(a - - 1). It is easy to check that f"(l) = 1, 

Wf = (a + l)/3 and the left hand side of (jHJ) is (a + 1)(2 — a)\u — l|/3it, which satisfies the condition 
for -1 < a < 2. 

Now, for < a < 1, write X a = (1 — a) -1 log 

for X > (cf. TopS0e (321) to g et that > «£>(l-a) 2-a(l-a)D (1 _ a) - ^ a %«)' □ 

Remark 1. Pinsker's inequality can be seen as the limiting case of the inequality just stated 
for as a — > or equivalently of that stated for I a as a | 1. 

Remark 2. The behavior of Z a for a > 1 is somewhat puzzling to us. We will show in Section 4 
(cf. Theorem [7| and Corollary |HJ) that for any — 1 < a < 2 there are probability measures P v and 
such that V(P V , Q v ) = v and D(i_ a }(P v , Q v ) = \ v 2 + ^ (a + 1)(2 — a) v 4 + o(i> 4 ). Hence, we must 
have that l a (P v , Q v ) = (a—l)^ 1 log[l—a(l—a)D^ 1 _ a ^ ) (P v ,Q v )] = ^av 2 +-^a(l+5a— 5a 2 )v 4 +o(v 4: ). 
Since 1 + 5a — 5a 2 < for a > ^ + | \/5 ~ 1.17, it follows that in this case we cannot have that 
I a > \ctV 2 . We do not know whether the inequality holds for 1 < a < | + V5. 



1 + 



l-a(l-a)£>n_ a1 



and use that log(l + x) > 



4 Fourth-order inequalities 

The inequality D>\V 2 + ±V A is a consequence of the fact that 

for all u > together with © and ®. To prove 0, let #(u) = (u - 1 - log it) [1 + §(it - 1)][1 + 
|| (it - l)] 3 - \{u - 1) 2 [1 + f|(it - l)] 3 - ^(u - 1) 4 [1 + §(it - 1)] and use Lemma after showing 
that g(l) = g'(l) = g"(l) = g"'(l) = g^(l) = g^(l) = and g^\u) = (43904it 4 - 50960u 3 + 
44268u 2 — 34102m + 24565)/u 6 is positive everywhere because 

43904u 4 - 50960it 3 + 44268n 2 - 34102it + 24565 

65 n4 88347, 1907903 x2 10273158845617 

= 43904 '"- 224» + — '"- 2827104' + 723738624 " < 8 > 
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In this section we generalize the idea in the last paragraph for arbitrary /-divergences. In other 
words, an inequality of the form Df > C2jV 2 + c^jV 4 would be obtained if we can prove that 

(u-1) 2 (^-l) 4 
- ^ 1 + (1 -«*,/)(« -1) + C4J [l + (l-^/)(n-l)]3- (9) 

for all u > 0. For this inequality to hold and being sharp enough so that C2 / and c^j are best 
possible, it is necessary that the Taylor expansions of both sides around u = 1 must coincide up 
to and including fifth-order terms. This condition implies the expression of the Cj f's and Wij's 
(i = 2, 4) in terms of the derivatives of / at u = 1 in the next theorem. 



Theorem 7 (Fourth-order extended Pinsker inequality for /-divergences). Let f and f 

TvJ' C4 >/ ~ 72 \. 6 J y 1 ) 4 /"(i) 



be as before and define c 2 j = - ^ , W2,f = 1 + \ jttjQ , c^j = ^[3/^ 4 ^(l) — 4 ^pUp ] and 



1 9/(^(1) -20^ff 
" 4J 1 + 45 3/W(l)- 4 ™ • 
Suppose that both C2j > and C4J > and i/ia£ /or every u > we have 
f(u) [1 + (1 - w 2J )(u - 1)] [1 + (1 - n; 4 ,/)(n - l)] 3 

> C 2 ,f (U - l) 2 [1 + (1 - W 4J )(U - l)] 3 + C4J (U - l) 4 [1 + (1 - W 2J ){U - 1)] . (10) 

Then Df > C2jV 2 + c^jV 4 . The coefficient c±j is best possible. 

Proof: We will prove first that (fl7)|) implies that < Wij < 1 (i = 2,4). Reasoning by 
contradiction, suppose first that W2j [0,1], and evaluate both sides of (fTU|) at u = — n; 2 ,//(l — 
^2,/) > to obtain that > c 2j /(l — ^2,/) 5 (^4,/ — ^2,/) 3 - Since C2j > 0, this implies that w^j > 
w 2 ,/ when W2j > 1 and < W2j when w 2 ./ < 0. Hence also w^j [0, 1] and we can evaluate 
again both sides of (jlOj) now at u = — u;4j/(l — w^j) > to get that > C4j(l — W4j) 5 (w 2 j — W4 : j), 
so that u>4j > 1 implies now that W2j > w^j while w^j < implies that W2j < n?4 Hence we 
must have in any case that W2j and w^j are equal, say u> 2 j = w^j = w, so that (|10j) becomes 

/(u) [1 + (1 - u/)(u - l)] 4 > c 2 j (u - l) 2 [1 + (1 - w)(u - l)] 3 + c 4 ,/ (u - l) 4 [1 + (1 - w)(u - 1)] . (11) 

Since we are still assuming that n? 2 ,/ = w ^ [0,1], consider this inequality as u — > —w/(l — w) > 0. 
The left hand side is equivalent to f[—w/(l — w)](l — w) 4 [u + w/{l — w)] 4 , i.e. a positive coefficient 
times an infinitesimal term of order 4 in [u + w/(l — w)], while the right hand side is equivalent to 
C4j(l — w) 3 [u + w/(l — w)], i.e. an infinitesimal term of order 1 in [u + w/(l — w)]. If w > 1 we 
have that the principal part here is c^j(l — w) s < and hence (|llj) cannot hold as u | —w/(l — w), 
while if w < the principal part of the right hand side is 04 t (1 — w) 3 > and 1)11)1 cannot hold as 
u [ —w/(l — w). This contradiction is due to the assumption that u> 2 ,/ [0, 1]. Similarly we prove 
that < UI4J < 1. 

Now that we have proved that both [1 + (1 — W2j)(u — 1)] and [1 + (1 — w±j)[u — 1)] are positive 
for u > 0, we observe that Condition (jlOj) implies Q and hence that 

Df{P,Q) = Dj(P,Q) 

r (g/P-1) 2 , , / (g/y-i) 4 , 

- c w i + (i- W 2j)( q /p-D pdf * +cA > f j [i + {i- WiJ){q /p-i)? pd ^ 
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Hence, use respectively © and (0} to bound each term in the right hand side to obtain that 
Df > c 2 j V 2 + c 4 j V 4 . 

The proof that c 4 j is best possible is similar to the last part of the proof of Theorem^ Consider 
again a binary space and for small v define P = (p, 1 — p) and Q v = (p + v/2, 1 — p — v/2) so that 
V(P, Q v ) = v and D f (P, Q v ) = pf(l + v/2p) + (1 - p)f(l - v/2{\ - p)). Now let p = \ + \^v 
and expand Df(P, Q v ) around v = to obtain that Dj(P, Q) = c%j v 2 + c 4 j v 4 + o{v 4 ). We leave 
the details to the reader. rj 

The following condition on the derivatives of / is usually easier to prove than 

Corollary 8 Let f, f, c 2 f, W2J, c aj an d w 4 f be as before, and suppose that f is six times 
differentiable with /"(it) > for all u. Then 



/(6)( 



■It 



f"(u) 



[1 + (1 - w 2J )(u - 1)][1 + (1 - w 4J )(u - l)] 3 



f (5) ( u \ 

+ ~r&) [1 + (1 " W4j)(u " 1)]2[4 " W2J " 3W4J + 4(1 " W2 ' /)(1 " WA > f){u ~ 1)] 

f (4) (V) 

+ 90; 7^y( 1 " w 4,/)[l + (1 " W4,/)(« " 1)][2 - w 2J - w 4J + 2(1 - w 2 j)(l - w 4J )(u - 1)] 
/ \ u ) 

+120^—^(1 - w 4J ) 2 [A - 3w 2J - w 4J + 4(1 - w 2J )(l - w 4J )(u - 1)] 

+144(1 -w 2J )(l -w 4J f > (12) 
implies MU\) and hence that Df > C2j T^ 2 + c 4 j V 4 . 

Proof: Let o(u) = f{u)[l + (1 - u^,/)(u - 1)][1 + (1 - w 4J ){u - l)] 3 - c 2 j(u - 1) 2 [1 + (1 - 
w 4 j)(u — l)] 3 — c 4 j(u — 1) 4 [1 + (1 — W2j)(u — 1)]. Now it is straightforward, although rather tedious, 
that 0(1) = o'(l)'= a"(l) = g"'(l) = g (4) (l) = g {5) (l) = 0, while g^\u) equals the left hand side 
of (|12|) times f"(u) and hence is nonnegative. Hence, Lemma implies that g(u) > 0, which is 
equivalent to (|1()|). □ 

The case of Jeffrey's divergence (/(it) = f(u) = (u — 1) log it) is interesting, because we have 
that C2j = 1, W2j = \, c 4 j = w 4j f = 2 and the left hand side of (|T2^1 is §(5w 4 — 8it 3 + 
9u 2 -8u + 5)/u 4 = §[5(u - |) 4 + f (u- f-) 2 + f§]/n 4 > 0. Hence J > V 2 + ±V 4 . We note 
that from D > t^V 2 + ^qV 4 it follows immediately using the symmetry of V that J(P,Q) = 
D(P, Q) + D(Q, P)>V 2 + jgV 4 , but this bound is worst than the one we found using Corollary El 
To finish this section we present the special case of the relative information and information gain 
of order a. 

Corollary 9 D {a) > \V 2 + i(a + 1)(2 - a)V 4 for -1 < a < 2, a / 0, 1. Also, l a > § V 2 + 
3gQ<(l + 5a — 5a 2 )V 4 for < a < 1. In froi/t cases i/ie coefficients ofV 4 are best possible. 

Proof: We prove first the assertion for D^ a y We have f{u) = [a(a — l)]~ 1 (u a — 1) and then 
C2j = |, tt>2,/ = ^3~K c 4 j = if2-(a + 1)(2 — a), w 4 j = 17 ^ la . Again straightforwardly although 
rather lengthy, the left hand side of (|12j) is 



(a + 1)(2 - a) 
273375 m" 4 



-(a - 5) (a - 3)(a - 4)(lla + 17) 3 
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+2(a - 3)(q - 4)(22a 2 - 28a - 59)(lla + I7) 2 u 

-6(lla + 17)(-28 + lla)(a - 3) (a + 2)(lla 2 - 11a - 31)u 2 

+2(3 + a)(a + 2)(22a 2 - 16a - 65)(-28 + lla)V 

- (a + 4) (3 + a)(a + 2)(-28 + 11a) V 1 



(13) 



which we prove in the appendix that is positive for u > when — 1 < a < 2. 

Now, for Renyi's information gain we proceed as in the proof of Corollary 0] and use again that 
log(l + x) > ^ for x > 0, so that 



1 + 



Z« = (1 - a)" 1 lo. 



a(l — a)D<i_ 



(l-a) 



1 — a(l — a)-D(i-Q!) 

2 1 

> af(i_ a )[l + -a(l - a)£>(i_ a )] 



(l-a) 

> ^ 2 + l(a + 1)(2 - a)U 4 + y (1 - «)^U 4 = |u 2 + A a (l + 5a - 5a 2 )U 4 



□ 



Using Corollary we can obtain bounds for the Tsallis and Cressie-Read divergences, since they 
can be expressed in terms of Dr a \ . 



5 A note about higher-order inequalities 

Unfortunately, the tools we have used so far are insufficient to obtain inequalities of the form 
Df > C2jV 2 + c-ijV + • • • + C2 n jV 2n including terms of at least order 6. To understand why, 
it is useful to restrict attention to the information divergence D. Our discussion at the beginning 
of Sections 2 and 3 shows that the second and fourth-order inequalities for D are due respectively 
to the inequalities u — 1 — logu > \{u — 1) 2 /[1 + §(u — 1)] and Q and Corollary El Now, it is 
straightforward to check that the difference between the left and the right hand sides of (J7J) equals 
12150 ( n ~~ I) 6 + 0(\ u ~ 1| 7 )- Indeed, we conjecture that 

I ~ l l. J_ (^~ I) 4 41 (u- l) 6 

U ~ " ° g " " 2 1 + |( W - 1) + 36 [1 + - 1)]3 + 12150 [1 + §§||(^ - 1)]5 • ( } 

However, even if we could prove this assertion, we would obtain then only that D > ^V 2 + ^U 4 + 
12 41 50 U 6 , The coefficient 12 41 50 , even if close, is smaller than the best possible ^ found by Tops0e 
[To] . A possible explanation for this, or in other words, from where the difference ^ — 12 41 50 = 
comes from, can be obtained looking at the divergences which result from the right hand side of 
(|14j). We have on one side that / \ + {2/z){l)p~i) P^L 1 > U 2 (P, Q), with equality holding if and only if 
+ — 1)] oc \q/p— 1| (actually, the proportionality constant must equal 1/V(P,Q)). Similarly, 

/ [i+(28/45)~~(q/p-i)] a P d V - ^ 4 (-P,Q), with equality holding if and only if [1 + f§(g/p- 1)] oc \q/p-l\. 
Hence, it follows from these two inequalities that 

1 / { ? lp -^ ,^+1 / r , {q/p - l) * , , P fr> V(P,Q) + lu 4 (P,Q), (15) 

2 J l + (2/3){q/p-l) F P 36 J [1 + (28/45) (g/p - l)] 3 P_ 2 v 36 V V ; 
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but now equality cannot hold since, even if close, | ^ ||. In fact, we conjecture that the infimum 
of the left hand side of ()15j) taken over all P and Q such that V(P, Q) = v equals ^ v 2 + ^ v A + 
( 27q — 12150 )^ 6 + We hope to be able to report on these issues soon. 



6 Appendix 

Before proving (|13f) we will state the following lemma. We have already used the idea in the lemma 
to obtain the decomposition (JHJ). 

Lemma 10 Let T(u) = C4U 4 + C3U 3 + C2U 2 + c\u + co and define 04 = C4, 02 = | [8C2C4 — 3c|]/c4 
and 

1 2048c c|c2 - 768cqc%c§ - 8c 4 c 2 c 4 + cj + 64clcicl - 512c|c| 
a ° " 256cl 8c 2 C4 - 3c§ ' ( j 

T/ien a sufficient condition for T{u) > /or even/ u is that 04, 02 and ao are nonnegative. 
Proof: Check that 



/ , - • , C 3 \4 , / , 1 16ciC 4 - c 3 2 



□ 

Proof of (fT^l : Let -1 < a < 2, T(-u) = T a (u) be the term between brackets in JT31) and 
define c« = Cj(a) and = aj(a) as in Lemma ITUl so that for instance 04 = C4 = — (a + 4) (a + 
3)(a + 2)(lla — 28) 3 which is of course nonnegative. Next 



9 (980a 3 + 552a 2 - 4257a - 4207)(a + 2)(-28 + 11 



a 



fl2 "2" a + 4 

which is positive because 

980a 3 + 552a 2 - 4257a - 4207 = -(980a + 1532)(2 - a)(a + 1) - (1143 + 765a) 

is negative. Using (fTO)) we obtain that ao = 9Pio(a)/32a2(a + 4) 4 where 

P 10 (a) = -20792743232a 10 - 168248775872a 9 + 54551858544a 8 

+3066837388032a 7 + 4844633801556a 6 - 14799467270700a 5 - 43681339670379a 4 
-4381425810042a 3 + 94728169651149a 2 + 113143847999692a + 41092635382468 . 

Hence, to conclude the proof we need to show that ao > or equivalently that Pio(a) > whenever 
— 1 < a < 2. This follows from the following identity 

P 10 (a) = (20792743232a 2 + 189041519104a + 300831606416)(2 - a) 3 (a + l) 5 
+ (1295259115248a + 3335882569236) (2 - a) 2 (a + l) 4 
+ (1471491213228a + 7953881034231) (2 - a)(a + l) 3 
+1343948812407(2 - a) (a + l) 2 

+(4661891728632a 2 + 11252369540556a + 6746792560920) , (17) 
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since after examining the signs of the different factors it is possible to conclude that each term in 
the sum is nonnegative. rj 
Remark. Checking that this last identity holds constitutes a formal proof of the fact that 
-Pio( a ) > for every — 1 < a < 2. Explaining how we obtain it is a bit harder, specially since it 
involves a "trial and error" process. Essentially, we try to divide Pio(ct) by a polynomial A(a) of 
degree a which was known to be positive for the desired range. Hence we obtain that Pio(a.) = 
Q(a)A(a) + R(a), where the degrees of Q and R are at most (10 — a) and (a — 1). A sufficient 
condition for Pio(a) > is then that both Q(a) and R(a) are nonnegative for the desired range. 
Since we have — 1 < a < 2, natural candidates for A(a) took the form (2 — a) m (a + l) n . The 
polynomial division can be made easily using a symbolic manipulation package (cf. the function quo 
in MAPLE), while a plotting routine can make an initial assessment of whether the decomposition 
was successful, i.e. whether both Q and R are nonnegative (if not, we would try again with different 
m and n). For instance, the first successful division made to arrive to (|17|) had A(a) = (2 — a) 3 (l + 
a) 5 . In a sense this means that we change a degree 10 problem (showing that P\q is nonnegative) 
by two problems having degrees 2 and 7 (showing respectively that Q and R are nonnegative). 
The same procedure can be repeated for Q and for R and then for their respective quotients and 
rests and so on until all polynomials involved are either of the form (2 — a) m (a + l) n or have at 
most degree 2 and their roots (hence their signs) can be obtained analytically. After doing all these 
divisions it is easy to put back everything together into a unique decomposition as in (|17j) . Of 
course, we have no guarantee that this procedure would work for any polynomial, but it worked 
for Pio. 
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